# COMPSCI 389: Introduction to Machine Learning
# Topic 5.4 Evaluation, One Last Time

**Note:** This notebook is described in the slides, `5.4 Evaluation Part 4.pdf`. All of the important content within this notebook is in those slides, so you are not responsible for this notebook. However, you may reference this notebook to run the examples from the slides.

The code below should be review. It:
1. Imports the libraries we use
2. Defines the evaluation metrics we use
3. Defines the KNearestNeighbors model
4. Defines the WeightedKNearestNeighbors model

In [1]:
import pandas as pd
from sklearn.neighbors import KDTree
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

def mean_squared_error(predictions, labels):
 return np.mean((predictions - labels) ** 2)

def root_mean_squared_error(predictions, labels):
 return np.sqrt(mean_squared_error(predictions, labels))

def mean_absolute_error(predictions, labels):
 return np.mean(np.abs(predictions - labels))

def r_squared(predictions, labels):
 ss_res = np.sum((labels - predictions) ** 2) # ss_res is the "Sum of Squares of Residuals"
 ss_tot = np.sum((labels - np.mean(labels)) ** 2) # ss_tot is the "Total Sum of Squares"
 return 1 - (ss_res / ss_tot)

class KNearestNeighbors(BaseEstimator):
 # Add a constructor that stores the value of k (a hyperparameter)
 def __init__(self, k=3):
 self.k = k

 def fit(self, X, y):
 # Convert X and y to NumPy arrays if they are DataFrames
 if isinstance(X, pd.DataFrame):
 X = X.values
 if isinstance(y, pd.Series):
 y = y.values

 # Store the training data and labels
 self.X_data = X
 self.y_data = y
 
 # Create a KDTree for efficient nearest neighbor search
 self.tree = KDTree(X)

 return self

 def predict(self, X):
 # Convert X to a NumPy array if it's a DataFrame
 if isinstance(X, pd.DataFrame):
 X = X.values

 # Query the tree for the k nearest neighbors for all points in X
 dist, ind = self.tree.query(X, k=self.k)

 # Return the average label for the nearest neighbors of each query
 return np.mean(self.y_data[ind], axis=1)
 
class WeightedKNearestNeighbors(BaseEstimator):
 # Add a constructor that stores the value of k and sigma (hyperparameters)
 def __init__(self, k=3, sigma=1.0):
 self.k = k
 self.sigma = sigma

 def fit(self, X, y):
 # Convert X and y to NumPy arrays if they are DataFrames
 if isinstance(X, pd.DataFrame):
 X = X.values
 if isinstance(y, pd.Series):
 y = y.values

 # Store the training data and labels
 self.X_data = X
 self.y_data = y
 
 # Create a KDTree for efficient nearest neighbor search
 self.tree = KDTree(X)

 return self

 def gaussian_kernel(self, distance):
 # Gaussian kernel function
 return np.exp(- (distance ** 2) / (2 * self.sigma ** 2))

 def predict(self, X):
 # Convert X to a NumPy array if it's a DataFrame
 if isinstance(X, pd.DataFrame):
 X = X.values

 # We will iteratively load predictions, so it starts empty
 predictions = []
 
 # Loop over rows in the query
 for x in X:
 # Query the tree for the k nearest neighbors
 dist, ind = self.tree.query([x], k=self.k)

 # Calculate weights using the Gaussian kernel
 weights = self.gaussian_kernel(dist[0])

 # Check if weights sum to zero. This happens when all points are very far, giving weights that round to zero, causing divison by zero later. In this case, revert to un-weighted (all weights are one).
 if np.sum(weights) == 0:
 # If weights sum to zero, assign equal weight to all neighbors
 weights = np.ones_like(weights)

 # Weighted average of the labels of the k nearest neighbors
 weighted_avg_label = np.average(self.y_data[ind[0]], weights=weights)
 predictions.append(weighted_avg_label)

 # Return the array of predictions we have created
 return np.array(predictions)

## Algorithm Evaluation

Notice that the discussion so far has focussed on using a test set to evaluate a single model that was trained from data. This captures our uncertainty about the performance of the model that was learned. If we run the algorithm many times on different training sets, we could obtain models of different quality. The true MSE of each model could differ! Our analysis so far did not capture this.

The analysis above is useful for testing how much you can trust a specific model, but less useful for comparing algorithms in general. To compare algorithms, we can do the following:
- Specify a number of trials, `num_trials`
- For each trial $i$ in $1,...,\text{num\_trials}$ do:
 - Sample a data set (ideally independent of the data sets for other trials)
 - Split the data set into training and testing sets
 - Use the ML algorithm to train a model on the training set.
 - Use the trained model to make predictions for the testing set.
 - Compute the sample performance metric (e.g., sample MSE) for the test set. Call this $Z_i$.
- Compute and report the average sample MSE.
- Compute and report the standard error of $Z_1,\dotsc,Z_\text{num\_trials}$.

This standard error incorporates uncertainty due to both the sample MSE and the varying MSE of the learned models.

## Cross-Validation

Notice that we can't easily do this using the GPA data set, since we can't generate `num_trials` indepent data sets (unless we consider data sets much smaller than our actual data set).

Cross-validation is a technique that resolves this, by repeatedly splitting the same data set into different training and testing sets. The most common version is $k$-fold cross-validation, which operates as follows.

- **Input:** Dataset `D`, Number of folds `k`, Machine Learning Algorithm `ML_Algo`
- **Output:** Cross-validated performance estimate

Procedure:

1. Split `D` into `k` equal-sized subsets (folds) `F1, F2, ..., Fk`.
2. For `i` from 1 to `k`:
 - Set aside fold `Fi` as the validation set, and combine the remaining `k-1` folds to form a training set.
 - Train the model `M` using `ML_Algo` on the `k-1` training folds.
 - Evaluate the performance of model `M` on the validation fold `Fi`. Store the performance metric `P_i`.
3. Calculate the average of the performance metrics: `Average_Performance = mean(P_1, P_2, ..., P_k)`.
4. Optionally, calculate other statistics (like standard deviation or standard error) of the performance metrics across the folds.

One notable variant of k-fold cross-validation is **leave-one-out (LOO) cross-validation**, which sets `k` equal to the size of the data set so that each fold is a single point.

Scikit-Learn has a useful function [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), which simplifies creating folds.



In [2]:
import pandas as pd
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the data set
df = pd.read_csv("data/GPA.csv", delimiter=',')

# We already loaded X and y, but do it again as a reminder
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Define the model
model = WeightedKNearestNeighbors(k=300, sigma=100)

# Choose number of folds for k-fold Cross-Validation
k = 20
kf = KFold(n_splits=k, shuffle=True, random_state=1)

# Function to compute MSE for each fold
def mse_for_fold(train_index, test_index, model, X, y):
 X_train, X_test = X.iloc[train_index], X.iloc[test_index]
 y_train, y_test = y.iloc[train_index], y.iloc[test_index]
 model.fit(X_train, y_train)
 predictions = model.predict(X_test)
 return mean_squared_error(y_test, predictions)

# Compute MSE for each fold
mse_scores = [mse_for_fold(train_index, test_index, model, X, y) for train_index, test_index in kf.split(X)]

# Calculate the average MSE and standard error
average_mse = np.mean(mse_scores)
mse_standard_error = np.std(mse_scores, ddof=1) / np.sqrt(k)

print(f"Average MSE: {average_mse:.3f}")
print(f"MSE Standard Error: ±{mse_standard_error:.3f}")


Average MSE: 0.571
MSE Standard Error: ±0.004


The code below helps with visualizing how kfold is used.

In [3]:
display(kf)
for train_index, test_index in kf.split(X):
 print("TRAIN:", train_index, "TEST:", test_index)
 mse_score = mse_for_fold(train_index, test_index, model, X, y)
 print("MSE Score for this fold:", mse_score)

KFold(n_splits=20, random_state=1, shuffle=True)

TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 10 44 45 ... 43267 43290 43296]
MSE Score for this fold: 0.5807234989808185
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 40 93 134 ... 43246 43256 43261]
MSE Score for this fold: 0.5630048290694765
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 3 23 34 ... 43262 43286 43288]
MSE Score for this fold: 0.5553467010840363
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 8 19 25 ... 43277 43293 43299]
MSE Score for this fold: 0.6129428000450592
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 11 33 58 ... 43255 43282 43292]
MSE Score for this fold: 0.5933726084007112
TRAIN: [ 0 1 3 ... 43300 43301 43302] TEST: [ 2 22 24 ... 43271 43284 43298]
MSE Score for this fold: 0.5644141827789226
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 21 36 55 ... 43241 43249 43275]
MSE Score for this fold: 0.573666751853279
TRAIN: [ 0 1 2 ... 43300 43301 43302] TEST: [ 26 46 62 ... 43223 43231 43265]
MSE Score for this fold: 0.5764896819599702
TRAIN: [ 0 1 2 ... 43300 43